In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
Read data. There are two datasets: Planon and Master. The latter is the EIS data nomencalture that was created. Master is made up of two subsets: loggers and meters. Loggers are sometimes called controllers and meters are sometimes called sensors. In rare cases meters or sensors are also called channels.
In [2]:
planon=pd.read_excel('Data Loggers.xlsx',index_col = 'Code')
master_loggerscontrollers = pd.read_csv('LoggersControllers.csv', index_col = 'Asset Code')
master_meterssensors = pd.read_csv('MetersSensors.csv', encoding = 'macroman', index_col = 'Asset Code')
Unify index, caps everything and strip of trailing spaces.
In [7]:
planon.index=[str(i).upper().strip() for i in planon.index]
master_loggerscontrollers.index=[str(i).upper().strip() for i in master_loggerscontrollers.index]
master_meterssensors.index=[str(i).upper().strip() for i in master_meterssensors.index]
Drop duplicates (shouldn't be any)
In [8]:
planon.drop_duplicates(inplace=True)
master_loggerscontrollers.drop_duplicates(inplace=True)
master_meterssensors.drop_duplicates(inplace=True)
Split Planon import into loggers and meters
Drop duplicates (shouldn't be any)
In [9]:
# Split the Planon file into 2, one for loggers & controllers, and one for meters & sensors.
planon_loggerscontrollers = planon.loc[(planon['Classification Group'] == 'EN.EN4 BMS Controller') | (planon['Classification Group'] == 'EN.EN1 Data Logger')]
planon_meterssensors = planon.loc[(planon['Classification Group'] == 'EN.EN2 Energy Meter') | (planon['Classification Group'] == 'EN.EN3 Energy Sensor')]
planon_loggerscontrollers.drop_duplicates(inplace=True)
planon_meterssensors.drop_duplicates(inplace=True)
Index unique? show number of duplicates in index
In [10]:
len(planon_loggerscontrollers.index[planon_loggerscontrollers.index.duplicated()])
Out[10]:
In [11]:
len(planon_meterssensors.index[planon_meterssensors.index.duplicated()])
Out[11]:
Meters are not unique. This is becasue of the spaces served. This is ok for now, we will deal with duplicates at the comparison stage. Same is true for loggers - in the unlikely event that there are duplicates in the future.
In [12]:
planon_meterssensors.head(3)
Out[12]:
Create list of all buildings present in Planon export. These are buildings to check the data against from Master.
In [13]:
buildings=set(planon_meterssensors['BuildingNo.'])
buildings
Out[13]:
Create dataframe slice for validation from master_meterssensors
where the only the buildings located in buildings
are contained. Save this new slice into master_meterssensors_for_validation
. This is done by creating sub-slices of the dataframe for each building, then concatenating them all together.
In [14]:
master_meterssensors_for_validation = \
pd.concat([master_meterssensors.loc[master_meterssensors['Building Code'] == building] \
for building in buildings])
master_meterssensors_for_validation.head(2)
Out[14]:
In [15]:
#alternative method
master_meterssensors_for_validation2 = \
master_meterssensors[master_meterssensors['Building Code'].isin(buildings)]
master_meterssensors_for_validation2.head(2)
Out[15]:
Planon sensors are not unique because of the spaces served convention in the two data architectures. The Planon architecture devotes a new line for each space served - hence the not unique index. The Master architecture lists all the spaces only once, as a list, therefore it has a unique index. We will need to take this into account and create matching dataframe out of planon for comparison, with a unique index.
In [16]:
len(master_meterssensors_for_validation)
Out[16]:
In [17]:
len(planon_meterssensors)-len(planon_meterssensors.index[planon_meterssensors.index.duplicated()])
Out[17]:
Sort datasets after index for easier comparison.
In [18]:
master_meterssensors_for_validation.sort_index(inplace=True)
planon_meterssensors.sort_index(inplace=True)
In [19]:
planon_meterssensors.T
Out[19]:
In [20]:
master_meterssensors_for_validation.T
Out[20]:
Create dictionary that maps Planon column names onto Master.
From Nicola:
Building code
and Building name
are implicitly included. Logger Serial Number
, IP
or MAC
would be essential to include, as well as Make
and Model
. Additional Location Info
is not essnetial but would be useful to have. Locations (Locations.Space.Space number
and Space Name
) are included in the Planon export - but this is their only viable data source, therefore are not validated against.
In [21]:
#Planon:Master
meters_match_dict={
"BuildingNo.":"Building Code",
"Building":"Building Name",
"Description":"Description",
"EIS ID":"Logger Channel",
"Tenant Meter.Name":"Tenant meter",
"Fiscal Meter.Name":"Fiscal meter"
}
Filter both dataframes based on these new columns. Then remove duplicates. Currently, this leads to loss of information of spaces served, but also a unique index for the Planon dataframe, therefore bringing the dataframes closer to each other. When including spaces explicitly in the comparison (if we want to - or just trust the Planon space mapping), this needs to be modified.
In [22]:
master_meterssensors_for_validation_filtered=master_meterssensors_for_validation[list(meters_match_dict.values())]
planon_meterssensors_filtered=planon_meterssensors[list(meters_match_dict.keys())]
In [23]:
master_meterssensors_for_validation_filtered.head(2)
Out[23]:
In [24]:
planon_meterssensors_filtered.head(2)
Out[24]:
Unify headers, drop duplicates (bear the mind the spaces argument, this where it needs to be brought back in in the future!).
In [25]:
planon_meterssensors_filtered.columns=[meters_match_dict[i] for i in planon_meterssensors_filtered]
In [26]:
planon_meterssensors_filtered.drop_duplicates(inplace=True)
master_meterssensors_for_validation_filtered.drop_duplicates(inplace=True)
In [27]:
planon_meterssensors_filtered.head(2)
Out[27]:
Fiscal/Tenant meter name needs fixing from Yes/No and 1/0.
In [28]:
planon_meterssensors_filtered['Fiscal meter']=planon_meterssensors_filtered['Fiscal meter'].isin(['Yes'])
planon_meterssensors_filtered['Tenant meter']=planon_meterssensors_filtered['Tenant meter'].isin(['Yes'])
master_meterssensors_for_validation_filtered['Fiscal meter']=master_meterssensors_for_validation_filtered['Fiscal meter'].isin([1])
master_meterssensors_for_validation_filtered['Tenant meter']=master_meterssensors_for_validation_filtered['Tenant meter'].isin([1])
In [29]:
master_meterssensors_for_validation_filtered.head(2)
Out[29]:
In [30]:
planon_meterssensors_filtered.head(2)
Out[30]:
Cross-check missing meters
In [32]:
a=np.sort(list(set(planon_meterssensors_filtered.index)))
b=np.sort(list(set(master_meterssensors_for_validation_filtered.index)))
meterssensors_not_in_planon=[]
for i in b:
if i not in a:
print(i+',',end=" "),
meterssensors_not_in_planon.append(i)
print('\n\nMeters in Master, but not in Planon:',
len(meterssensors_not_in_planon),'/',len(b),':',
round(len(meterssensors_not_in_planon)/len(b)*100,3),'%')
In [33]:
a=np.sort(list(set(planon_meterssensors_filtered.index)))
b=np.sort(list(set(master_meterssensors_for_validation_filtered.index)))
meterssensors_not_in_master=[]
for i in a:
if i not in b:
print(i+',',end=" "),
meterssensors_not_in_master.append(i)
print('\n\nMeters in Planon, not in Master:',
len(meterssensors_not_in_master),'/',len(a),':',
round(len(meterssensors_not_in_master)/len(a)*100,3),'%')
Check for duplicates in index, but not duplicates over the entire row
In [34]:
print(len(planon_meterssensors_filtered.index))
print(len(set(planon_meterssensors_filtered.index)))
print(len(master_meterssensors_for_validation_filtered.index))
print(len(set(master_meterssensors_for_validation_filtered.index)))
In [35]:
master_meterssensors_for_validation_filtered[master_meterssensors_for_validation_filtered.index.duplicated()]
Out[35]:
The duplicates are the nan
s. Remove these for now. Could revisit later to do an index-less comparison, only over row contents.
In [66]:
good_index=[i for i in master_meterssensors_for_validation_filtered.index if str(i).lower().strip()!='nan']
master_meterssensors_for_validation_filtered=master_meterssensors_for_validation_filtered.loc[good_index]
master_meterssensors_for_validation_filtered.drop_duplicates(inplace=True)
In [69]:
len(planon_meterssensors_filtered)
Out[69]:
In [70]:
len(master_meterssensors_for_validation_filtered)
Out[70]:
Do comparison only on common indices. Need to revisit and identify the cause missing meters, both ways (5 Planon->Meters and 30 Meters->Planon in this example).
In [71]:
comon_index=list(set(master_meterssensors_for_validation_filtered.index).intersection(set(planon_meterssensors_filtered.index)))
In [73]:
len(comon_index)
Out[73]:
In [76]:
master_meterssensors_for_validation_intersected=master_meterssensors_for_validation_filtered.loc[comon_index].sort_index()
planon_meterssensors_intersected=planon_meterssensors_filtered.loc[comon_index].sort_index()
In [77]:
len(master_meterssensors_for_validation_intersected)
Out[77]:
In [78]:
len(planon_meterssensors_intersected)
Out[78]:
Still have duplicate indices. For now we just drop and keep the first.
In [80]:
master_meterssensors_for_validation_intersected = master_meterssensors_for_validation_intersected[~master_meterssensors_for_validation_intersected.index.duplicated(keep='first')]
In [81]:
master_meterssensors_for_validation_intersected.head(2)
Out[81]:
In [82]:
planon_meterssensors_intersected.head(2)
Out[82]:
In [83]:
planon_meterssensors_intersected==master_meterssensors_for_validation_intersected
Out[83]:
In [84]:
np.all(planon_meterssensors_intersected==master_meterssensors_for_validation_intersected)
Out[84]:
Number of cells matching
In [85]:
(planon_meterssensors_intersected==master_meterssensors_for_validation_intersected).sum()
Out[85]:
Percentage matching
In [86]:
(planon_meterssensors_intersected==master_meterssensors_for_validation_intersected).sum()/\
len(planon_meterssensors_intersected)*100
Out[86]:
In [87]:
((planon_meterssensors_intersected==master_meterssensors_for_validation_intersected).sum()/\
len(planon_meterssensors_intersected)*100).plot(kind='bar')
Out[87]:
In [88]:
df=pd.DataFrame((planon_meterssensors_intersected.T==master_meterssensors_for_validation_intersected.T).sum())
df
Out[88]:
In [89]:
df=pd.DataFrame((planon_meterssensors_intersected.T==master_meterssensors_for_validation_intersected.T).sum()/\
len(planon_meterssensors_intersected.T)*100)
df[df[0]<100]
Out[89]:
In [90]:
df[df[0]<100].plot(kind='bar')
Out[90]:
Not all of the dataframe matches. Let us do some basic string formatting, maybe that helps.
In [91]:
sum(planon_meterssensors_intersected['Description']!=master_meterssensors_for_validation_intersected['Description'])
Out[91]:
In [92]:
planon_meterssensors_intersected['Description']=[str(s).lower().strip().replace(' ',' ').replace(' ',' ') for s in planon_meterssensors_intersected['Description'].values]
master_meterssensors_for_validation_intersected['Description']=[str(s).lower().strip().replace(' ',' ').replace(' ',' ') for s in master_meterssensors_for_validation_intersected['Description'].values]
In [93]:
sum(planon_meterssensors_intersected['Description']!=master_meterssensors_for_validation_intersected['Description'])
Out[93]:
Some errors fixed, some left. Let's see which ones. These are either:
In [95]:
for i in planon_meterssensors_intersected[planon_meterssensors_intersected['Description']!=master_meterssensors_for_validation_intersected['Description']].index:
print(i,'\t\tPlanon:',planon_meterssensors_intersected.loc[i]['Description'],'\t\tMaster:',master_meterssensors_for_validation_intersected.loc[i]['Description'])
Let us repeat the exercise for Logger Channel
. Cross-validate, flag as highly likely error where both mismatch.
In [96]:
sum(planon_meterssensors_intersected['Logger Channel']!=master_meterssensors_for_validation_intersected['Logger Channel'])
Out[96]:
In [97]:
planon_meterssensors_intersected['Logger Channel']=[str(s).lower().strip().replace(' ',' ').replace(' ',' ') for s in planon_meterssensors_intersected['Logger Channel'].values]
master_meterssensors_for_validation_intersected['Logger Channel']=[str(s).lower().strip().replace(' ',' ').replace(' ',' ') for s in master_meterssensors_for_validation_intersected['Logger Channel'].values]
In [98]:
sum(planon_meterssensors_intersected['Logger Channel']!=master_meterssensors_for_validation_intersected['Logger Channel'])
Out[98]:
All errors fixed on logger channels.
In [99]:
for i in planon_meterssensors_intersected[planon_meterssensors_intersected['Logger Channel']!=master_meterssensors_for_validation_intersected['Logger Channel']].index:
print(i,'\t\tPlanon:',planon_meterssensors_intersected.loc[i]['Logger Channel'],'\t\tMaster:',master_meterssensors_for_validation_intersected.loc[i]['Logger Channel'])
New error percentage:
In [100]:
(planon_meterssensors_intersected!=master_meterssensors_for_validation_intersected).sum()/\
len(planon_meterssensors_intersected)*100
Out[100]:
In [101]:
master_loggerscontrollers_for_validation = \
pd.concat([master_loggerscontrollers.loc[master_loggerscontrollers['Building Code'] == building] \
for building in buildings])
master_loggerscontrollers_for_validation.head(2)
Out[101]:
In [102]:
len(master_loggerscontrollers_for_validation)
Out[102]:
In [103]:
len(planon_loggerscontrollers)-len(planon_loggerscontrollers.index[planon_loggerscontrollers.index.duplicated()])
Out[103]:
In [104]:
master_loggerscontrollers_for_validation.sort_index(inplace=True)
planon_loggerscontrollers.sort_index(inplace=True)
In [105]:
planon_loggerscontrollers.T
Out[105]:
In [106]:
master_loggerscontrollers_for_validation.T
Out[106]:
Create dictionary that maps Planon column names onto Master.
From Nicola:
Building code
and Building name
are implicitly included. Logger IP
or MAC
would be essential to include, as well as Make
and Model
. Additional Location Info
is not essnetial but would be useful to have. Locations (Locations.Space.Space number
and Space Name
) are included in the Planon export - but this is their only viable data source, therefore are not validated against.
In [107]:
#Planon:Master
loggers_match_dict={
"BuildingNo.":"Building Code",
"Building":"Building Name",
"Description":"Description",
"EIS ID":"Logger Serial Number",
"Make":"Make",
"Model":"Model"
}
In [108]:
master_loggerscontrollers_for_validation_filtered=master_loggerscontrollers_for_validation[list(loggers_match_dict.values())]
planon_loggerscontrollers_filtered=planon_loggerscontrollers[list(loggers_match_dict.keys())]
In [109]:
master_loggerscontrollers_for_validation_filtered.head(2)
Out[109]:
In [110]:
planon_loggerscontrollers_filtered.head(2)
Out[110]:
In [111]:
planon_loggerscontrollers_filtered.columns=[loggers_match_dict[i] for i in planon_loggerscontrollers_filtered]
In [112]:
planon_loggerscontrollers_filtered.drop_duplicates(inplace=True)
master_loggerscontrollers_for_validation_filtered.drop_duplicates(inplace=True)
In [113]:
planon_loggerscontrollers_filtered.head(2)
Out[113]:
In [114]:
master_loggerscontrollers_for_validation_filtered.head(2)
Out[114]:
In [115]:
a=np.sort(list(set(planon_loggerscontrollers_filtered.index)))
b=np.sort(list(set(master_loggerscontrollers_for_validation_filtered.index)))
loggerscontrollers_not_in_planon=[]
for i in b:
if i not in a:
print(i+',',end=" "),
loggerscontrollers_not_in_planon.append(i)
print('\n\nMeters in Master, but not in Planon:',
len(loggerscontrollers_not_in_planon),'/',len(b),':',
round(len(loggerscontrollers_not_in_planon)/len(b)*100,3),'%')
In [116]:
a=np.sort(list(set(planon_loggerscontrollers_filtered.index)))
b=np.sort(list(set(master_loggerscontrollers_for_validation_filtered.index)))
loggerscontrollers_not_in_master=[]
for i in a:
if i not in b:
print(i+',',end=" "),
loggerscontrollers_not_in_master.append(i)
print('\n\nMeters in Planon, not in Master:',
len(loggerscontrollers_not_in_master),'/',len(a),':',
round(len(loggerscontrollers_not_in_master)/len(a)*100,3),'%')
In [117]:
print(len(planon_loggerscontrollers_filtered.index))
print(len(set(planon_loggerscontrollers_filtered.index)))
print(len(master_loggerscontrollers_for_validation_filtered.index))
print(len(set(master_loggerscontrollers_for_validation_filtered.index)))
In [118]:
master_loggerscontrollers_for_validation_filtered[master_loggerscontrollers_for_validation_filtered.index.duplicated()]
Out[118]:
In [119]:
comon_index=list(set(master_loggerscontrollers_for_validation_filtered.index).intersection(set(planon_loggerscontrollers_filtered.index)))
In [120]:
master_loggerscontrollers_for_validation_intersected=master_loggerscontrollers_for_validation_filtered.loc[comon_index].sort_index()
planon_loggerscontrollers_intersected=planon_loggerscontrollers_filtered.loc[comon_index].sort_index()
In [121]:
master_loggerscontrollers_for_validation_intersected.head(2)
Out[121]:
In [122]:
planon_loggerscontrollers_intersected.head(2)
Out[122]:
In [123]:
planon_loggerscontrollers_intersected==master_loggerscontrollers_for_validation_intersected
Out[123]:
Loggers matching
In [124]:
(planon_loggerscontrollers_intersected==master_loggerscontrollers_for_validation_intersected).sum()
Out[124]:
Percentage matching
In [125]:
(planon_loggerscontrollers_intersected==master_loggerscontrollers_for_validation_intersected).sum()/\
len(planon_loggerscontrollers_intersected)*100
Out[125]:
In [126]:
((planon_loggerscontrollers_intersected==master_loggerscontrollers_for_validation_intersected).sum()/\
len(planon_loggerscontrollers_intersected)*100).plot(kind='bar')
Out[126]:
Loggers not matching on Building Name
.
In [127]:
sum(planon_loggerscontrollers_intersected['Building Name']!=master_loggerscontrollers_for_validation_intersected['Building Name'])
Out[127]:
In [128]:
planon_loggerscontrollers_intersected['Building Name']=[str(s).lower().strip().replace(' ',' ').replace(' ',' ') for s in planon_loggerscontrollers_intersected['Building Name'].values]
master_loggerscontrollers_for_validation_intersected['Building Name']=[str(s).lower().strip().replace(' ',' ').replace(' ',' ') for s in master_loggerscontrollers_for_validation_intersected['Building Name'].values]
In [129]:
sum(planon_loggerscontrollers_intersected['Building Name']!=master_loggerscontrollers_for_validation_intersected['Building Name'])
Out[129]:
That didnt help.
In [130]:
for i in planon_loggerscontrollers_intersected[planon_loggerscontrollers_intersected['Building Name']!=master_loggerscontrollers_for_validation_intersected['Building Name']].index:
print(i,'\t\tPlanon:',planon_loggerscontrollers_intersected.loc[i]['Building Name'],'\t\tMaster:',master_loggerscontrollers_for_validation_intersected.loc[i]['Building Name'])
Follow up with lexical distance comparison. That would flag this as a match.
Loggers not matching on Serial Number
.
In [131]:
sum(planon_loggerscontrollers_intersected['Logger Serial Number']!=master_loggerscontrollers_for_validation_intersected['Logger Serial Number'])
Out[131]:
In [132]:
planon_loggerscontrollers_intersected['Logger Serial Number']=[str(s).lower().strip().replace(' ',' ').replace(' ',' ').replace('{','').replace('}','') for s in planon_loggerscontrollers_intersected['Logger Serial Number'].values]
master_loggerscontrollers_for_validation_intersected['Logger Serial Number']=[str(s).lower().strip().replace(' ',' ').replace(' ',' ').replace('{','').replace('}','') for s in master_loggerscontrollers_for_validation_intersected['Logger Serial Number'].values]
In [133]:
sum(planon_loggerscontrollers_intersected['Logger Serial Number']!=master_loggerscontrollers_for_validation_intersected['Logger Serial Number'])
Out[133]:
In [134]:
for i in planon_loggerscontrollers_intersected[planon_loggerscontrollers_intersected['Logger Serial Number']!=master_loggerscontrollers_for_validation_intersected['Logger Serial Number']].index:
print(i,'\t\tPlanon:',planon_loggerscontrollers_intersected.loc[i]['Logger Serial Number'],'\t\tMaster:',master_loggerscontrollers_for_validation_intersected.loc[i]['Logger Serial Number'])
Technically the same, but there is a number format error. Compare based on float value, if they match, replace one of them. This needs to be amended, as it will throw cannot onvert to float
exception if strings are left in from the previous step.
In [135]:
z1=[]
z2=[]
for i in planon_loggerscontrollers_intersected.index:
if planon_loggerscontrollers_intersected.loc[i]['Logger Serial Number']!=master_loggerscontrollers_for_validation_intersected.loc[i]['Logger Serial Number']:
if float(planon_loggerscontrollers_intersected.loc[i]['Logger Serial Number'])==\
float(master_loggerscontrollers_for_validation_intersected.loc[i]['Logger Serial Number']):
z1.append(str(int(float(planon_loggerscontrollers_intersected.loc[i]['Logger Serial Number']))))
z2.append(str(int(float(planon_loggerscontrollers_intersected.loc[i]['Logger Serial Number']))))
else:
z1.append(planon_loggerscontrollers_intersected.loc[i]['Logger Serial Number'])
z2.append(master_loggerscontrollers_for_validation_intersected.loc[i]['Logger Serial Number'])
else:
z1.append(planon_loggerscontrollers_intersected.loc[i]['Logger Serial Number'])
z2.append(planon_loggerscontrollers_intersected.loc[i]['Logger Serial Number'])
planon_loggerscontrollers_intersected['Logger Serial Number']=z1
master_loggerscontrollers_for_validation_intersected['Logger Serial Number']=z2
In [136]:
for i in planon_loggerscontrollers_intersected[planon_loggerscontrollers_intersected['Logger Serial Number']!=master_loggerscontrollers_for_validation_intersected['Logger Serial Number']].index:
print(i,'\t\tPlanon:',planon_loggerscontrollers_intersected.loc[i]['Logger Serial Number'],'\t\tMaster:',master_loggerscontrollers_for_validation_intersected.loc[i]['Logger Serial Number'])
New error percentage:
In [137]:
(planon_loggerscontrollers_intersected!=master_loggerscontrollers_for_validation_intersected).sum()/\
len(planon_loggerscontrollers_intersected)*100
Out[137]:
(Bearing in my mind the above, this is technically 0)